Skip to content

Conversation

@Max-Cheng
Copy link
Contributor

@Max-Cheng Max-Cheng commented Nov 13, 2025

Description

This PR implements page-level size limits for Parquet file reading to prevent
out-of-memory errors caused by extremely large pages.

Background:

Why page-level instead of column-level:

  • Validation occurs when reading page headers in ParquetColumnChunkIterator,
    before page objects are created
  • Applies uniformly to all page types (DataPageV1, DataPageV2, DictionaryPage)
  • Catches oversized pages earlier, preventing memory allocation for invalid data
  • More aligned with Parquet's actual structure where pages are the basic read unit

Changes:

  • Added maxPageSizeInBytes parameter to ParquetColumnChunkIterator to validate
    uncompressed page size
  • Throws ParquetCorruptionException when a page exceeds the configured limit
  • Added session property parquet_max_page_size and config property
    parquet.max-page-read-size (default: 500MB) for:
    • Hive connector
    • Iceberg connector
    • Delta Lake connector
  • Added unit test TestPageReader.testPageSizeLimit() to verify the validation
    logic

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive, Delta, Hudi, Iceberg
* Avoid reading unusually large pages from parquet files. The configuration property `parquet.max-page-read-size` can be used to limit the maximum allowed size of a parquet page during reads. Files with parquet pages larger than this will generate an exception on read.
  ({issue}`27148`)

@cla-bot cla-bot bot added the cla-signed label Nov 13, 2025
@github-actions github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Nov 13, 2025
@Max-Cheng Max-Cheng requested a review from sopel39 November 13, 2025 06:39
@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch 2 times, most recently from d85e703 to 4544755 Compare November 13, 2025 08:32
@wendigo
Copy link
Contributor

wendigo commented Nov 13, 2025

Please remove feat: from commit message. It is not convention we use

@Max-Cheng
Copy link
Contributor Author

Please remove feat: from commit message. It is not convention we use

Ok.

@Max-Cheng Max-Cheng changed the title feat:allow user set parquet_max_page_size/parquet.max-page-size to avoid large page cause node OOM Allow user set parquet_max_page_size/parquet.max-page-size to avoid large page cause node OOM Nov 13, 2025
@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch from 4544755 to 04989ed Compare November 13, 2025 12:09
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % @raunaqmorarka PTAL too

@sopel39
Copy link
Member

sopel39 commented Nov 13, 2025

one comment

@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch 2 times, most recently from 5fc07bc to 51b8cdb Compare November 13, 2025 16:13
@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch from 51b8cdb to 0e5b68e Compare November 17, 2025 08:13
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch from 0e5b68e to 600cce9 Compare November 17, 2025 15:25
@github-actions github-actions bot added the docs label Nov 17, 2025
@raunaqmorarka raunaqmorarka changed the title Allow user set parquet_max_page_size/parquet.max-page-size to avoid large page cause node OOM Avoid reading unusually large parquet pages Nov 17, 2025
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, lgtm otherwise

@raunaqmorarka
Copy link
Member

Please squash the commits, we don't need separate commit for adding docs entry

Reading unusually large parquet pages can lead to workers
going into full GC and crashing.
This change adds a guard rail to fail reads of such files gracefully.
@Max-Cheng Max-Cheng force-pushed the draft/parquet_column_size_limit branch from 600cce9 to e9095ab Compare November 18, 2025 12:18
@raunaqmorarka raunaqmorarka merged commit b2ccb28 into trinodb:master Nov 18, 2025
68 checks passed
@github-actions github-actions bot added this to the 479 milestone Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector docs hive Hive connector iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

5 participants